The adaptive Gril estimator with a diverging number of 

parameters 



Mohammed El Anbari and Abdallah Mkhadri 

Department of Mathematics, Faculty of Sciences Semlalia, 
Cadi Ayyad University, B.P. 2390 Marrakesh, Morocco 
and Dept. de mathematiques Batiment 425 
Universite Paris-Sud, 91405 Orsay Cedex. 

February 27, 2013 



Abstract 

We consider the problem of variables selection and estimation in linear regression model 
in situations where the number of parameters diverges with the sample size. We propose 
the adaptive Generalized Ridge-Lasso (AdaGril) which is an extension of the the adaptive 
Elastic Net. AdaGril incorporates information redundancy among correlated variables for 
model selection and estimation. It combines the strengths of the quadratic regularization and 
the adaptively weighted Lasso shrinkage. In this paper, we highlight the grouped selection 
property for AdaCnet method (one type of AdaGril) in the equal correlation case. Under 
weak conditions, we establish the oracle property of AdaGril which ensures the optimal large 
performance when the dimension is high. Consequently, it achieves both goals of handling the 
problem of collinearity in high dimension and enjoys the oracle property. Moreover, we show 
that AdaGril estimator achieves a Sparsity Inequality, i. e., a bound in terms of the number 
of non-zero components of the 'true' regression coefficient. This bound is obtained under a 
similar weak Restricted Eigenvalue (RE) condition used for Lasso. Simulations studies show 
that some particular cases of AdaGril outperform its competitors. 

Keywords and phrases: Adaptive Regularization, Variable Selection, High Dimension, Oracle 
Property, Sparsity Inequality. 

1 Introduction 

We consider the problem of variable selection and estimation for general linear regression model 

y = X/3* + e, (1) 

where y = (yi,...,yn)* is an n-vector of responses, X = (xi,...,Xp) is a nxp design matrix of 
p predictor vectors of dimension n, f3* is a p-vector of unknown parameters which are to be 
estimated, t stands for the transpose and e: is a n-vector of (i.i.d.) random errors with mean 
and variance cr^. Without loss of generality we assume that the data are centered. 

When p is large, selection of a small number of predictors that contribute to the response 
leads often to a parsimonious model. It amounts to assuming that (3* is sparse in the sense 
s < p components are non-zero. Denote the set of non-zero values by ^ = {j; |/3*| ^ 0}. In this 
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setting, variable selection can improve on both estimation accuracy and interpretation. Our goal 
is to determine the set A and to estimate the true corresponding coefficients. 

Sparsity is associated to high dimensional data, where the number of predictors p is typically 
comparable or exceeds the sample size n. The problem occurs frequently in genomics and pre- 
teomics studies, functional MRI, tumor classification and signal processing (cf. Fan and Li 2008). 
In many of these applications, we would like to achieve both variable reduction and prediction 
accuracy. 

Variable selection for high dimensional data has received a lot of attention recently. In the 
last decade interest has focused on penalized regression methods which implement both variable 
selection and coefficient estimation in a single procedure. The most well known of these proce- 
dures are Lasso (Tishirani 1996, Chen et al. 1998) and SCAD (Fan and Li 2001), which have 
good computational and statistical properties. 

In fact, there has been a large rapidly growing body of literature for the Lasso and SCAD 
studies over the past few years. Osborne et al. (2000) derived the optimality conditions associ- 
ated with the Lasso solution. Some theoretical statistical aspects of the Lasso estimator of the 
regression coefficients have been derived by Knight and Fu (2000) in finite dimension setting. 
Many other extensions for asymptotic and non asymptotic results can be found in Zhang and 
Yu (2006) and Bunea et al (2007), etc. 

Various extensions and modifications of the Lasso have been proposed to ensure that on 
one hand, the variable selection process is consistent and on the other hand, the estimated 
regression coefficient has a fast rate of convergence. Fan and Li (2001) showed that the SCAD 
enjoys the oracle property, that is, the SCAD estimator can perform as well as the oracle if the 
penalization parameter is appropriately chosen. Fan and Peng (2004) studied the asymptotic 
behavior of SCAD when the dimensionality of the parameter diverges. Fan and Li (2001) showed 
that asymptotically the Lasso estimates produce non-ignorable bias. Zou (2006) showed that the 
Lasso has not the oracle property in finite parameter setting as conjectured in Fan and Li (2001). 
Zhao and Yu (2008) established the same result for p > n case. 

To overcome the bias problem of Lasso, Zou (2006) proposed the adaptive Lasso estimator 
(AdaLasso) defined by 

p 

^AdaLasso = arg min 1 1 y - A:/3 1 1 ^ + A J]] tI;^ | | , (2) 

i=i 

where the weights Wj = {\l3j\)~'^ {j = 1, . . . ,p), with 7 is a positive constant and ^ is an initial 
consistent estimate of f3*. We recall here that /^Lg^ggQ is the solution to a similar equation ([2]) in 
which Wj = 1 for all j. 

The second most drawback of the Lasso (and also AdaLasso or ii penalization methods) is 
its poor performance when there are highly correlated predictors. Under high dimensionality, 
the situation is particularly dire. Zou and Hastie (2005) showed that the Lasso estimates are 
instable when predictors are highly correlated. They proposed the Elastic Net (Enet) for variable 
selection, which combines ii and ^2 penalties. El Anbari and Mkhadri (2008) proposed a pro- 
cedure called Elastic Corr-Net (Cnet) which combines the ii and the correlation based penalty 
of Tutz and Ulbricht (2009). Daye and Jeng (2009) proposed a slightly similar approach called 
the Weighted Fusion (WFusion). These two approaches can incorporate information redundancy 
among correlated predictors for estimation and variable selection. Numerical studies have shown 
that Cnet and WFusion outperform the Lasso and Enet in certain situations. In the same setting, 
Hebiri and van De Geer (2010) considered the Smooth-Lasso procedure (S-Lasso), a modification 



2 



of the Fused-Lasso procedure (Tibshirani et al. 1998), in which a second ii Fused penalty is 
replaced by the smooth £2 norm penalty. The general formulation englobing all the four latter 
approaches, called the Generalized Ridge Lasso (Gril) estimator, can be defined by 

^Gril(^i. ^2) = argmmlly - X(3\\l + Ai||/3||i + X2p'Q(3, (3) 

where Q is a positive semi-definite matrix. A similar formulation was cited in Daye and Jeng 
(2009) and Hebiri and van De Geer (2010) in regression problem and in Clemmensen et al. (2008) 
in classification problem. Moreover, the computation of the estimates of the parameters of Gril 
procedure can be obtained efficiently via a modification of LARS algorithm (Efron et al. 2004). 

The Gril estimator (Enet in particular) resolves the coUinearity problem of Lasso, and AdaLasso 
estimator possesses the oracle property of SCAD. However, in high dimensional setting, the Gril 
misses the oracle property, while AdaLasso estimates are instable because of bias problem of 
Lasso. Recently, Zou and Zhang (2009) proposed the adaptive Elastic Net (AdaEnet) that 
combines the strengths of £2 norm and the adaptive weighted ii shrinkage. They established 
the oracle property of the AdaEnet when the dimension diverges with the sample size. Inde- 
pendently, Ghosh (2007) proposed the same AdaEnet, but he specially focused on the grouped 
selection property of AdaEnet along with its model selection complexity. 

Despite its popularity, Enet (an also AdaEnet) has been critized for being inadequate, notably 
in situations in which additional structural knowledge about predictors should be taken into 
account (cf. Bondel and Reich 2008, El anbari and Mkhadri 2008, Daye and Jeng 2009, Hebiri 
and van De Geer 2010, Slawski et al. 2010 and She 2010). To this end, these authors complement 
^1— regularized with a second regularized based on the total variation or the quadratic penalty. 
The former aims at the explicit inclusion of structural knowledge about predictors, while the 
latter aims at taken into account some type of correlation between predictors. The experimental 
results of these alternatives have shown that Enet performs worse in grouping highly correlated 
predictors. But, similar to Enet, these new estimators are asymptotically biased because of 
the ii component in the penalty and they cannot achieve selection consistency and estimation 
efficiency simultaneously. 

Therefore, there is a need to develop methods that take into account of additional structural 
information of predictors and have the oracle property. In the same spirit of AdaEnet, we propose 
the adaptive Gril (AdaGril) that penalizes the least square loss using a mixture of weighted £2 
norm and the adaptive weighted ii penalty. We first highlight the grouped selection property 
for AdaCnet method (one type of AdaGril) in the equal correlation case, meaning that it selects 
or drops highly correlated predictors together. Under weak conditions, as in Zou and Zhang 
(2009), we study its asymptotic properties when the dimension diverges with the sample size. 
In particular, we show that the AdaGril enjoys the oracle property with a diverging number 
of predictors. Moreover, we show that AdaGril estimator achieves a Sparsity Inequality, i. e., 
a bound in terms of the number of non-zero components of the 'true' regression coefficient. 
This bound is obtained under a similar weak Restricted Eigenvalue (RE) condition used for 
Lasso. Finally, a detailed experimental performance comparison of different Gril estimators is 
considered. 

In Section 2, we focus the Cnet method and sketch briefiy other Gril estimators. A computa- 
tional algorithm to approach their solutions is presented and we briefly summarize some of their 
statistical properties obtained in fixed dimensional setting. In Section 3, we define the Adap- 
tive Gril estimator and begin by showing the property of grouping effect of AdaCnet in equal 
correlation case. Then, we establish the Statistical asymptotic theory of the AdaGril when the 
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dimension diverges, including the oracle property. We end by showing that AdaGril achieves a 
Sparsity Inequality. Computational aspects of adaptive Gril is discussed in Section 4. A detailed 
simulation study is performed in Section 5, which illustrates the performance of particular three 
cases of Gril and AdaGril estimators in relation to AdaEnet estimator. A brief discussion is given 
in Section 6. All technical proofs are provided in Section 7. 

2 Different Gril estimators 

In this section, we present a brief introduction of our alternative to Enet, called the Elastic 
Corrnet (Cnet), which takes into account the correlation between predictors in the quadratic 
penalty. Two other competitor Gril estimators are presented and their statistical properties are 
summarized. 

2.1 Doubly regularized techniques 

Suppose that the predictors are Xj = (xji, . . . , Xip) and response values yj, for i = 1, . . . , n. 
Apart from lack of consistency, it is well known that Lasso has two limitations; for example 

a) Lasso does not encourage grouped selection in the presence of high correlated covariates and 

b) for p > n case Lasso can select at most n covariates. To overcome these limitations, Zou and 
Hastie (2005) proposed elastic net which combines both ridge (^2) and Lasso (£1) penalties. So, 
Enet procedure corresponds to the Gril estimator with Q = I„, where is the nxn identity 
matrix. 

Despite its popularity, Enet (an also AdaEnet) has been critiqued for being inadequate, notably 
in situations in which additional structural knowledge about predictors should be taken into 
account (cf. Bondel and Reich 2008, El anbari and Mkhadri 2008, Daye and Jeng 2009, Hebiri 
and van De Geer 2010, Slawski et al. 2010 and Shen 2010). To this end, these authors complement 
ii regularized with a second regularized based on the total variation or the quadratic penalty. 
The former aims at the explicit inclusion of structural knowledge about predictors, while the 
latter aims at taken into account some type of correlation between predictors. One example of 
the latter, is the Elastic Corr-Net (Cnet) (EL Anbari and Mkhadri 2008) which is a modification 
of Enet in which the ridge penalty term is replaced by the correlation based penalty term Pc{f3) 
defined by 



where pij = x*Xj denotes the (empirical) correlation between the ith and the jth predictors. 
The correlation based penalty Pc{l3), introduced by Tutz and Ulbricht (2009), will encourages 
grouping effect for highly correlated variables. This penalty can be written in a simple quadratic 
form 





^ 1 for 



(4) 
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Hence, Cnet is a particular Gril estimator with the weighted matrix Q defined by Cnet 
provided a good performance in simulations and real applications specially for highly correlated 
predictors. 

We can mention also the Weighted Fusion (WFusion) (Daye and Jeng 2009) and the Smooth- 
Lasso (S-Lasso) (Hebiri and van de Geer 2010) as alternatives to Cnet. In the former, the 
correlation based penalty is replaced by a modified weighted penalty -Pc(/3) = X]j>i ~ 
SijfBj)'^, where Wji = \pij\'~'/{l — \pij\), Sij = sgn(/jjj) the sign of pij and 7 > is a tuning 
parameter. While, the latter is a modification of the Fused-Lasso procedure (Tibshirani et al. 
1998), in which a second ii Fused penalty is replaced by the smooth £2 norm penalty. This 
quadratic term helps to tackle situations where the regression vector is structured such that 
its coefficients vary slowly. Surprisingly, this simple modification leads to good performance, 
specially when the regression vector is 'smooth', i. e., when the variations between successive 
coefficients of the unknown parameter of the regression are small. 



2.2 A computational algorithm 

In this section we propose a modification of the Elastic-Net algorithm for finding a solution of the 
penalized least squares problem ([3]) of the Gril. The main idea is to transform the Gril problem 
into an equivalent Lasso problem on the augmented data (cf. Zou and Hastie, 2005). Let 

^(„+p)xp = ^ ^/X^-^t ^ > y(n+p) = ) ^^"^ ^(n+p) = _^/A^L*/3* ) ' 
where Q is a real symmetric semi positive-definite square matrix with Choleski decomposition 



Q = LL* and L = Q 2 . The Gril estimator is defined as 



/3 = argmm||y-X/3||2 + Ai||/3||i. 

The latter result is a consequence of simple algebra, and it motivates the following comment on 
the Gril method. 

Remark 1. The Gril estimates can be computed via the Lasso modification of the LARS 
algorithm. For a fixed A2, it constructs at each step, which corresponds to a value of Ai, an 
estimator based on the correlation between covariates and the current residue. Then for a fixed 
A2, we obtain the evolution of the Gril estimator coefficient values when Ai varies. It provides 
the coefficient regularization paths of the Gril estimator which are piecewise linear (Efron et al., 
2004). Consequently, the Gril algorithm requires the same order of magnitude of computational 
effort as the OLS estimate via the Lasso modification of the LARS algorithm. 
Remark 2. If p > n, it is well known that LARS and its Lasso versions can select at most 
n variables before it puts all coefficients to nonzero. Now, applied LARS to augmented data 
(y, X), the lasso modification of the LARS algorithm is able to select all the p predictors in 
all situations. So the first limitation of the Lasso is easily surmounted. Moreover, the variable 
selection is performed in a fashion similar to the Lasso. 



2.3 Statistical properties of Different Gril estimators 

The model is assumed to be sparse, i. e. most the regression coefficients of /3* are exactly zero 
corresponding to predictors that are irrelevant to the response. Without loss of generality, we 
assume that the q first components of vector (3* are non-zero. We briefiy summarizes in this 
section the classical properties of model selection consistency of particular Gril estimators. 
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Yuan and Lin (2007) are the first to give a necessary and sufficient condition on the generating 
covariance matrices for the Elastic net to select the true model when q and p are fixed. The latter 
is called the Elastic Irrepresentable Condition (EIC) which is an extension of the Irrepresentable 
Condition (IC), defined in Zhao and Yu (2006), for Lasso's model selection consistency. For the 
general scaling of q,p and n, Jia and Yu (2010) give conditions on the relationship between q,p 
and n such that EIC guarantees the Elastic net's model selection consistency. Moreover, they 
showed that EIC is weaker than IC. In the same spirit, consistency properties and asymptotic 
normality are established when p < n for WFusion (Daye and Jeng 2009). For high dimensional 
setting p > n, Hebiri and van De Geer (2010) established recently variable selection consistency 
results for their Quadratic estimator, which corresponds exactly to our Gril estimator. They 
showed that Gril estimator achieves a Sparsity Inequality, i. e., a bound in terms of the number 
non-zero components of the 'true' vector regression. The latter result for n > ^3 is extended to 
AdaGril estimator in the next Section and its oracle properties are detailled when p diverges. 

3 The adaptive Gril estimator 

Now a revised version of Gril estimator, called AdaGril, is proposed by incorporating the adaptive 
weights in the £i penalty of equation ([3]). So, AdaGril is a combination of Gril and AdaLasso. 

We first assume that (3 is an initial estimator of (3* which is a root n-consistent. For example, 
we can choose PqI^ or Pqj-[1, and we construct the weights by 



Now, it is clear that AdaGril combines the strengths of Ridge regression and AdaLasso. So, 
AdaGril will avoid both the problem of coUinearity and bias problem of Lasso in high dimensional 
setting. The tuning parameters and Ai are directly responsable of sparsity of the estimates and 
are allowed to be different. While the same value of A2 is used for Gril and AdaGril estimators, 
because the quadratic norm in the £2 penalty leads to the same kind of contribution in both 
estimators. 

3.1 The grouping effect of AdaCnet 

Grouping effect is expressed when the regression coefficients of a group of highly correlated 
variables tend to be equal (up to a change of sign if negatively correlated). Similar to Cnet 
estimator, the AdaCnet estimator has the natural tendency of grouping each pair of regression 
coefficients according to their correlations. We establish in the following lemma the grouping 
effect of AdaCnet in the case of equal correlations. 





Then, the adaptive Gril estimates are defined by 




(7) 
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Lemma 3.1. Given data (y,X), where X = (xi|...|xp) and parameters (A^, A2), the response is 
centered and the predictors X standardized. Let /3(AJ, A2) be the AdaCnet estimate. 
If ^i{Xl,X2)Pj{Xl,X2) > and pki = p, for all{k,l) , then 



|y||2 



- Pi 



< 



1 



2{p + p-l)X2 



v/2(l - p) + 



|y||2min(|/3f 1, 1/3^1)7+1 



Remark 3. We note that 7 = leads the grouping effect of the Cnet as a special case. We 
also observe that the grouping effect has contributions not only from quadratic type penalty 
but also from Li type adaptive penalty. However if A — t- 0, then it is not possible to capture 

any grouping effect from only the Li type adaptive penalty. Moreover, when considering /3 • as 



univariate OLS estimates with min(|/3j?|, |/3 



1 



|y||2 



/3, - Pi 



< 



1 



> 1, the latter becomes 
2 



P 



2{p + p-l)X2 



(2 + 7At) v^2(r^. 



3.2 Model selection consistency for AdaGril when p diverges 

The oracle properties of the adaptive Elastic Net is provided in Ghosh (2007) for p < n. But, a 
detailed and much more elaborate discussion of the oracle properties of the adaptive elastic net is 
provided in Zou and Zhang (2009). In this section and as in Zou and Zhang (2009) we establish 
the oracle properties of the AdaGril estimator when p diverges (i. e. p(n) = < < 1). 

Moreover, we provide a bound on the mean squared sparsity inequality, that is a bound on the 
mean squared risk that takes into account the sparsity of the oracle regression vector (3. 



3.2.1 Mean Sparsity Inequality 

Now we establish the mean sparsity inequality achieved by the AdaGril estimator. For this 
purpose, we need the following assumption on the minimum and the maximum eigenvalues of 
the semi-positive definite matrices X*X and Q, respectively. 
(CI) Let Amin(M) and A 

max(-M-) denote the minimum and the maximum eigenvalues of a semi- 
positive definite matrix M, respectively. Then we assume 

b < Amin( — X*X) < Amaa;( — X*X) < B 

n n 

and 

d < Amin(Q) < XmaxiQ) < D 

where 6, i?, d and D axe constants so that b,B > and d,D > 0. 

Now, given the data (y, X), let u = (Cji, ...,ujp) be a vector whose components are all non- 
negative and can depend on (y,X). Define 

^^(A2, XI) = jargmm ||y - X/3||2 + A2/3*Q/3 + A^ J^iOjlP.l | 

for non-negative parameters AJ and A2. If oJj = 1 for all j, we denote /3(i(A2,A^) by /3(A2,A^) 
for convenience. The assumption (CI) assume a reasonably good behavior of both the predictor 
and the weight matrices (cf Portnoy 1984). 
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Theorem 3.2. If we assume the model and Assumption (CI), then 



XlD^P*\\l + Bpna^ + XflE(Y:U^' 



lE(\\p^iX2,Xl)-(3*\\l)<4 ^ ' ' 



{bn + X2dy 

In particular, when Coj = 1 for all j and if we note XI by Xi, we obtain the mean sparsity 
inequality for the Gril estimator. 

^(ll/3(A.,A0-/S-||^U4^«'''ll^ + ^''"'' + * 



bn + X2dy 

The latter risk bounds in Theorem 13.21 are non-asymptotic. It imphes that, under assumptions 
(C1)-(C6) defined below, /9(AJ, A2) is a root-(?i/p)-consistent estimator (cf. Fan and Peng (2004) 
for SCAD and Zou and Zhang (2009 ) for AdaEnet). So, the construction of the adaptive weight 
by using the Gril is appropriate. 

3.2.2 Oracle properties 

To establish the oracle properties, we need the same following assumptions used in Zou and 
Zhang (2009). 

(C2) lim„^oo ^■^-^"^^-^4 = 0. 
(C3) < 00 for some 6>0 

log(p) 
log(n) 



(C4) lim^^oo 1^ = 1^ for some < < 1. 



To construct the adaptive weights (w), we take a fixed 7 > numerical studies 

we let 7 = l^zj;] + 1 to avoid tuning on 7 as in Zou and Zhang (2009). Once 7 is chosen, we 
choose the regularization parameters according to the following conditions 

(C5) ^ ^ 

lim -^Sl = 0, for all i = 1, ■■■,p, lim — L = 0, 

and 

A^ (1— 

lim —= = 0, lim —;=n 2 = 00. 

n->oo y^n n-)-oo 



(C6) 

li-4^/E^=0' lim f^l'^^=0, lim min(-^,f^V)(min 1/3*1) ^00. 

n->oo^W^^J n^oo\n J Xf n^oo Xi^'\^XlJ ^JS^'^^'^ 

Theorem 3.3. Let us write /3* = (/3^,0) and define 

= argmin | ||y - X^/3||2 + A2/3*Q^/3 + A^ |] cl;,|/3^.|| . (8) 

Then, under the assumptions (C1)-(C6) and with probability tending to 1, (N^;9^,0) is solution 
to (Q). 
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Theorem 13.31 provides an asymptotic characterization of the solution to the adaptive Gril 
criterion. It demonstrates that the Adaptive Gril estimator is as efficient as an oracle one. 
Moreover, it is helpful in the proof of Theorem 3.3 below. 

Theorem 3.4. Under conditions (C1)-(C6), the adaptive Generalized Ridge Lasso has the oracle 
property, that is, the estimator f3{AdaGrit) must satisfy: 



1. Consistency in selection : Pr (^{j : (3{AdaGril)j ^ 0} = — > 1, 



2. Asymptotic normality : a^Sjj (/ + A2i;_4^Q^) N^^^ [P{AdaGrit)y( - /3^j -^^ iV(0, a"^), where 
X^, and N_4 are sub-matrices obtained by extracting the columns of X, Q and N re- 
spectively according to the indices in A, S_4 = X^X_4 and a is a vector of norm 1. 

Theorem 13.41 provides the selection consistency and asymptotic normality of AdaGril when 
the number of parameters diverges. So, AdaGril estimator enjoys the oracle property of SCAD 
in high dimensional setting. As a first special case and taking Q = I, we obtain the asymptotic 
normality of the Adaptive elastic net: 



a 



Taking A2 = 0, we obtain the asymptotic normality of the Adaptive Lasso as a second special 
case: 



a*Eji (^/3(AdaLasso)^ - N(0,(j^). 
3.3 Sparsity inequality for AdaGril estimator 

Now we establish a sparsity inequality (SI) achieved by /3(AdaGril), that is a bound on L2 and 
Li error estimation, in terms of the number of non-zero components of the 'true' coefficient 
vector f3* . Here the second parameter A2 is not free, but it depends on the parameter AJ which is 
fixed as a function of {n,p,a). Moreover, the Gril estimator (instead of OLS or Lasso estimator) 
is used as the initial estimator for the adaptive Gril method. Finally, our result of sparsity 
inequalities are obtained under a similar assumption on the Gram matrix used by the Lasso (cf. 
Bickel et al. 2009). Let us now establish the assumptions needed. 

Assumption RE. There is a constant tp > such that, for any z that satisfies "^j^^^c \zj\ < 
4max||,l|X]j.c4 |z,|, we have 



z*Kz> Vj]z2, 
where K = X*X and rj = minjg^(|/3j|). 



Let /3(Gril) and /3(AdaGril) denote the Gril estimator and the adaptive Gril estimator, respec- 
tively. Here, the weights of the adaptive estimator are estimated from the Gril estimator /3(Gril): 

U}j = max( \ 1) for all j = 1, ...,p. (9) 
|/3(Gnl)j| 

We note that Zhou et al. (2009) have also considered the same weights in their analysis of the 
adaptive Lasso for high dimensional regression and Gaussian graphical models. These weights 
are easy to manipulate in our proof of the next Theorem than the classical weights ([6]) . 
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Theorem 3.5. Given data (y,X). Let s = \A\,r] = minjg^(|/3*|) and (p € (0, 1). We define our 
tuning parameter X\ and A2 as follows 



Xl = 8V2a^M^ and X, - ^* 



n 8||Q/3*||oo 

Now, let S = 8'ip~^Xls. and for < 6 < 1, consider the set F = {maxj=i^...^p2|C/j| < 9Xi} with 
Uj = X]r=i ■ Therefore, if assumption RE holds and in addition i] > 25, then with 
probability greater than 1 — ip on the T set, we have 



- :K.(3{AdaGril)g < 4^~Uf ( : 



■ 2 ]V 

max < - , 1 > s 



- ^{AdaGril)\\i < S^/^-^A* (^max l|) s. 

The Restricted Eigenvalue (RE) Assumption is widely used in the literature about the variable 
selection consistency of £i-penalized regression methods in high dimension (p >> n, see for 
instance Bickel et al. 2009, Zhou et al. 2009 and Hebiri and van De Geer 2010). On the one 
hand, the main difference of our RE assumption with that in Bickel et al. (2009) are in the 
matrix K = X*X + A2Q and the specified constant 4 max l| instead of Kn = n~^X*X and 
an arbitrary constant cte, respectively. On the other hand, there is a minor difference with the 
assumption B{Q) used in Hebiri and van De Geer (2010). Indeed, the latter authors only need 

to consider the vectors z such that ^j^qI^jI < Pn^J^jj^^^j, where /3„ is a scalar which depend 
of (s, A*, A2, /3*). Moreover, our choice of regularized parameters (A*,A2) are relatively similar 
to that used by Hebiri and van De Geer (2010) in Corollary 1 for the sparsity inequality of Gril 
estimator (called in that paper Quadratic estimator). We then refer the reader to the latter 
reference for more discussions about that choice. 



4 Computation and tuning parameters selection 

In this section we propose a modification of the Gril algorithm for finding a solution of the 
penalized least squares problem d?]) of the AdaGril. The main idea is to transform the AdaGril 
problem into an equivalent Gril problem on the augmented data (cf. Zou and Hastie, 2005). The 
main steps of the AdaGril algorithm are as follows: 

1. Input: Matrix X and a>. 

2. Put Xj* = Xjd)j for j = 1, . . . ,p 

3. Use the Gril algorithm described in the Section 2.3 for computing the AdaGril estimator 
^(AdaGril). 

In practice, it is important to select appropriate tuning parameters (Ai,A2,7) in order to 
obtain a good prediction precision. Choosing the tuning parameters can be done via minimizing 
an estimate of the out-of-sample prediction error. If a validation set is available, this can be 
estimated directly. Lacking a validation set one can use ten-fold cross validation. Note that 
we take a fixed 7 = [j^] + 1 to avoid tuning on 7. So there are two tuning parameters in 
the AdaGril, so we need to cross-validate on a two dimensional surface. Typically we first pick 
a (relatively small) grid values for A2, say (0,0.01,0.1,1,10,100). Then, for each A2, LARS 
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algorithm produces the entire solution path of the AdaGril. The other tuning parameter is 
selected by tenfold CV. The chosen A2 is the one giving the smallest CV error or generalized 
cross-validation (GOV). However, Wang, Li and Tsai (2007) showed that for the SCAD method 
(cf. Fan and Li, 2001), BIG criterion is a better tuning parameter selector than GGV and 
AIC. In our implementations the parameter 7 is fixed for the three adaptive methods (AdaEnet, 
AdaLasso and AdaGril), while the couple of parameters (Ai,A2) is selected using BIG criterion. 

5 Numerical study 

In this section we consider some simulation experiments to evaluate the finite sample performance 
of different AdaGril estimators. AdapGnet, AdapWfusion and AdapSlasso methods correspond to 
adaptive Gnet, adaptive WFusion and adaptive Smooth-Lasso methods, respectively. These three 
adaptive versions of AdaGril are compared with Lasso, AdaLasso and AdaEnet. We consider the 
first simulated example used in Zou and Zhang (2009). In this example we generate data from 
the model, 

y = x*/3* + e, 

where (3* is a vector of length p and e ~ N(0,(T^), a G {3,6,9} and x ~ Np(0,R), R is the 
correlation matrix whose (i,j)th element is Rj^- = pl*"-^!. Results are given for p = 0.5 and 
p = 0.75. This example presents a situation in which the number of parameters depends on the 
sample size n as follows: p = Pn = [4?!^/^] — 5 for n = 100, 200, 1000. The true parameter is 

/3 = (1, 2, 1, <?, 0^,3^, -1, -2, -g + 1, -g)*, 

p-3q q 

where q = [pn/9]. For this choice of n and p, we have u = ^, so we used 7 = 3 for calculating 
the adaptive weights for all adaptive methods. 

* * * Table 1 GOES HERE * * * 

* * * Table 2 GOES HERE * * * 

* * * Table 3 GOES HERE * * * 

Table 1, Table 2 and Table 3 summarize the performance of different adaptive and non adap- 
tive methods in terms of prediction accuracy, estimation error and variable selection, respectively. 
Several observations can be made from these tables. 

1 . The adaptive methods outperform the non adaptive ones in terms of prediction and estimation 
accuracies, except in two small sample setting cases (i.e. n = 100 and 200 for o" = 9). 

2. In small sample settings, AdaSlasso is the winner in term of prediction accuracy followed by 
AdaGnet or Adalasso (except in [n = 100,0" = 9,p = 0.5] where Enet is the winner). However, 
for large sample, AdaGnet is the best in term of prediction accuracy followed by Adalasso (except 
in one case of [<t = 9, p = 0.75]). 

3. The AdaSlasso (or Slasso for n = 100 and (7 = 6 — 9, Table 2) seems to dominate its com- 
petitors in term of prediction error (i.e. MSE^) in small sample settings (n = 100 — 200). It is 
followed by AdaGnet or Gnet. However, AdaGnet is by far better than all other method in large 
sample size n = 1000 (except the case a = 9 and p = 0.75). 
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4. When increasing the noise level o", the methods behave in the same way by increasing substan- 
tially their prediction and error accuracies, and regardless of the sample size n and the correlation 
coefficient p. 

5. From Table 3, it can be seen that the performance in term of correct selection of all methods 
increase largely when the sample size increases, and whatever the value of noise level. While, 
their performance in term of incorrect selection increase slightly when the noise level increases, 
and especially in small sample settings. Moreover, the performance of the adaptive methods, 
in small settings, is relatively similar and is slightly better than those of the non adaptive ones 
(the difference between theme is about 3 — 5 percent), and whatever the values of a and p. 
However, in large sample setting, all methods behave in the same way by increasing largely their 
performance of correct selection of the relevant variables with a little advantage (3 — 5 percent) 
to AdaCnet and Cnet. 

Finally, we can conclude that in this example, the adaptive methods perform better than the 
non-adaptive ones in terms of variable selection and prediction accuracies, and whatever the 
values of n, a and p. Moreover, the AdaCnet and AdaSlasso outperform largely AdaEnet in 
quasi different situations. We have also considered a second example (Example 1 in Zou and 
Zhang 2009, results not reported here) where the structure of the parameter vector is smooth 
with small difference between successive coefficients. The results steal relatively similar to those 
obtained in example 1, but with some advantage to AdaSlasso in prediction accuracy and Slasso 
in prediction error. 

So, when the structure of the parameter vector is smooth, Slasso and AdaSlasso will have a 
clear advantage than its competitors. When this structure is not smooth and the coefficients 
have different signs, then Cnet and AdaCnet seem to work well in this setting. On the other 
hand, Enet and AdaEnet will give good results in extreme correlation case (p w 1), while its 
competitors (Cnet and Wfusion) give good results when the correlation is moderate. When the 
correlation is small, Lasso or AdaLasso can do better. 

6 Discussion 

In this paper we propose AdaGril for variable selection with a diverging number of parameters in 
the presence of highly correlated variables. AdaGril is a generalization of AdaEnet by replacing 
the identity matrix in the L2 norm penalty by any positive semi-definite matrix Q. Many possible 
choices of Q are in the literature. We show that under some conditions on the eigenvalues of Q 
we can extend results on variable selection consistency and asymptotic normality of the AdaEnet 
to the AdaGril. Moreover, we show that AdaGril estimator achieves a Sparsity Inequality, i. e., 
a bound in terms of the number of non-zero components of the 'true' regression coefficient. This 
bound is obtained under a similar weak Restricted Eigenvalue (RE) condition used for Lasso. 
Simulations studies show that some particular cases of AdaGril outperform its competitors. 
Simulated examples suggests that AdaGril methods improve both the AdaLasso and AdaEnet. 
The extension of the AdaGril to generalized linear models (McCuUagh and Nelder, 1989) will be 
subject to future work. 
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7 Proofs 

Proof. PROOF OF LEMMA O Let ^ = /3(A^,A2) = arg min^{L(AJ , A2, /3)}, where 

L{Xl A2, /3) = N I ||y - X/3||2 + X.p^QP + KJ2^ ^.1/3, 1 | , 

and Q is defined by If (3^(3 j > 0, then both /3j and 0j are non-zero, and we have sign(/3j) = 
sign(/3j). Then (3 must satisfies 

Hence we have 

p 

- 2x*{y - X^} + Xlwisign0,} + 2A2 g.fc^fc = 0, (11) 

k=l 

and 

p 

- 2x* {y - X^} + At7ijsign{^^.} + 2A2 ^ g.fc^fc = 0, (12) 

k=l 

Subtracting equation (|11|) from (|12|) gives 

p 

2(x* - x*){y - X^} - Xliwj - u;i)sign{^^.} - 2A2 ^^(gjfc - mO/^fc = 0, 

fc=i 

which is equivalent to 

^ . A* 

X^'^^J'^ ~ Qik)Pk = (x* - x*)r - y (i«j - ?Ui)sign{^j- } (13) 
k=l 

where f = y — X/3 is the residual vector. Since X is standardized, then ||xj — x^ II2 = 2(1 — pij). 
Because is the minimizer we must have L{A^, A2, /3(AJ, A2)} < L{Xl, X2, f3 = 0}, i.e. 

p 

||rf + A2;9*Q;9 + XIY, ^k\Pk\ < Ml (14) 

k=l 

So ||r(At,A2)||2 < ||y||2. 

Now, we apply the mean value Theorem, as in Ghosh (2007, Proof of Theorem 3.3), to the 
function g{x) = , we have \g{x) — g{y)\ = \g'{c)\\x — y\ for some c € [min(a;, y), max(a;, y)]. 
Hence we obtain 

1 1 
|2 1-7 _ I /q2 1-71 



where c € [min(|/3f |, |/3j|),max(|/3f |, |/3 



^.1 .1 .1.1 .1.1 



2 1 



< r^-^ \^I-Ph- 



min(|/3,^|,|/3||)7+i 
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Then the equation ()13p imphes that 



p 



Y.^q^k-qjk)h\ < ^||?(At,A2)||||xi-X2|| + l^l-^J 

k=i min(|4^|, 1)7+1 



1 1 

2 I 



(15) 



dividing by ||y||2, we obtain 



|— 1|- I Z^Wk - QjkjPk I < 7 \ 1 1 \Pi - Pj I (loj 

A2 ||y||2min(|^f|,|M|)7+i 



On the other hand, we have: 



/ • Pis , . Pis 

S^l S^] 



qii = -^-^ and qjj = - (17) 

Then 

Y,iHk-Qjk)Pk = -ri^[^j-P^i] + ^SN (18) 
J- Pa 



-2 



k=i - 



where SN = Y.k^Lj T^[Pi " PkiPk] + TZ^[P/tj^fc - /^j]- ^ Pfci = Pfcj = P, V/c = 1, then 
SN = ~ Pj)- So using (fT5]l we have: 



|y||2 



< 



2{p + p-l)\2 



+ ^-^ — I — l^f-^l 

||y||2min(|^f 1, 1^11)7+1 



This completes the proof. □ 

Proof. PROOF OF THEOREM [321 The proof is similar to that of Theorem 3.1 in Zou and 
Zhang (2009). We must only take account, in different inequalities, that 

hn + \2d < Amin(X*X) + A2Amin(Q) 
< A„,in(X*X + A2Q) 

Amax(XX + A2Q) < A 

max 

(X*X) + A2A 

max (Q) 

< Bn + X2d. 



□ 

Proof. PROOF OF THEOREM 13.31 To prove this Theorem, we must show, as in Zou & Zhang 
(2009), that (N_4/3_4, 0) satisfies the Karush-Kuhn- Tucker condition of ([7]) with probability tend- 
ing to 1. By the definition of /3^, it suffices to show 

Pr(Vi eA',\- 2X*(y - X^^^) + 2 qj^Pll < KiOj) ^ 1, 

keA 
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or equivalently 

Pr(3i G - 2X*(y - X^^^) + 2 J] qjk~Pl\ > Xluj) ^ 0. 
~ * 

The term 2 2jfcG^ Ijkl^k does not appear in the proof of the same Theorem 4.2 for adaptive 
elastic net in Zhang & Zou (2009). So, taking into account this difference, some modifications 
of the proof of Zou & Zhang's Theorem 4.2 are necessary. Let i] = min/;g^(|/3^|) and i) = 
min/,g^(|^(Gril)^|). We note that 

Pr(3i G - 2X*(y - X^^;;) + 2 ^ q^kPl\ > K'^j) 

< Pr(| - 2X* (y - X^^;:,) + 2 ^ q,kPl\ > A^ti,, f? > 7?/2) + Pr(r) < r//2). 

ie^= keA 

Since 

|^(Gril)j| > ?7 - 11/3* - ;3(Gril)||2 for aU j £ A, 

(19) 

we have 

V > r?-||/3*-^(Gril)||2. (20) 
If r) < r]/2, we have ||/3* - ^(Gril)||2 > r//2 and so 

Pr(^ < ry/2) < Pr(||^(Gnl) -/3*||2 > v/2) < ^^"^^^f ^^"'^ 
Then from Theorem 13.21 we have 

Moreover, let M = (A^/n)^/(^+'^) and using similar arguments as in the proof of equation (6.8) 
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in Zou &: Zhang (2009), we have 

^ Pr(| - 2X* (y - X^^;;) + 2 q^^Pll > AI%, f? > 7?/2) 

< Pr(| - 2X* (y - XaP*a) + 2 E ^^'^'^fcl > > ^Z^' < M) 

+ ^ Pr(|;3(Gri%| > M) 



K(||^(Gril)-/3*||i 



+ - 



M2 

^ ^T^^^ E I - ^Ky - + E > W2) 

1 \iGyl'= fcGyl 



{bn + A2(i)2M2 



(22) 



We have used the result of Theorem 13.21 for the second term of the last inequality. On the other 
hand, it is easy to show that 

I - X*(y - X^^:,) + ^ q,,~0l\^ < 2 J] |X*(y - X^^:^)^ + 2 J] [ J] q,,Pl\ . 
jeA'' keA jeA'' jeA" \keA ) 

From Zou & Zhang (2009), in page 16, we have 

^ |X*(y-X^^;:^)|2 < 2Bn.Bn\\ii\-~^\\^^2Bn\\e\% 

jeA'^ 

which leads to the following inequality 

iEl J]|X*(y-X^^:4)|2/(r)>r//2) 

\jeA'= 

< 2B^n^lE {^\\P*A - P*_^\\p{fi > r//2)) + 2Bnpa\ 

(23) 



We now bound E (^||/3^ - /3^yi(?? > ?7/2)j . Let 

'fi*^{X2, 0) = argmin {||y - X^/3||2 + X2P'QaP} ■ 
Then, as in Zou & Zhang (2009), by using the same arguments for deriving (jl9p . we have 



11/3^ -/3^(A2, 0)11, < A_(x5,X^) + A2d - 6^rTA^- ^''^ 



16 



Following the similar arguments used in the proof of Theorem 13. 2| we obtain 

lE(\\(3\-fi\\\ll{f,>r^/2)^ 



< 4- 



Now, we bound the second term X^^g^c ( X^fcg^ Q'jfc/^fc ] • In fact, we have 



{hn + \2dY 

2 



< 



< 



< 



E 



~*2 



\k€A 



\keA 



(Cauchy-Schwarz inequality) 



\keA 
p p 



j=i k=i 

— II/^^II2-IIQIIf (IMI-P is the Frobenius norm) 

< p||/9^||2-IIQIl2 (IMI2 is the spectral norm) 

< pdVaI 

< 2pDY^iA-(3*Af2 + 2pD^f3*A\\l 



It leads to the following inequality 



\jeA'= \keA / 



j&A'^ 

2 r/ ' ^ I o„n2||/a* ||2 



< 2pD'lE [\\f3\ - (3Jil{fi > + 2pDVA\r2 

The combination of ([21]), ([22]), ([23]) and ([26]) yields 

Pr(3j eA',\- 2X*(y - XaP*_^) + 2 qjk~Pl\ > Xlujj 



ki^A 



^ AiL'2p*||2 + ^p^^2^^2p 4 



{bn + A2d)2 M2 
Ai-C^ll/Slli + Spna^ + Afp 16 



(bn + A2(i)2 
Li + L2 + + -L4. 
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We have chosen 7 > 1v l(\ — v\ then under conditions (C1)-(C6) it fohows that 
^ ^( ( \* (i+7)(i-^)-i \~^\ 



n rj^ 



(26) 
□ 



Proof. PROOF OF THEOREM O The proof for selection consistency of the AdaGril is exactly 
similar to the proof of selection consistency of AdaEnet (cf. pages 1748-1749 in Zou and Zhang 
2009). 

We now prove the asymptotic normality. For convenience we put 

Zn = a*si (/ + AaS^^Q^) N^' (;3(AdaGril)^ - . 



Note that 



a'Y.\ {I + \2T.-^^Qa) {(3*a - ^a(^*a) + «*4 (/ + X2^:a'Qa) (Pa " ~P*Ai^2,0) 
+ a*sj (/ + AaS^^Q^) fc(A2, 0) - (3*_^ 



In addition, we have 

(/ + A2s:4^Q^) [^Ai^2,o) - (3-:^) = -\2^'^^qap*a + ^7^A^- 

Therefore, by Theorem 13.31 it follows that with probability tending to 1, = Ti +T2 + T-i, where 
Ti = a*sj (/ + AaS^^Q^) ^aPa " a' ^2^.'} ClA(i% where 

= - N^^ = diag {^^] , 

T2 = a*4 + ^2S^iQ^) (p\ - P*A^2,0)) , 
T3 = a^S^^Xj^e. 
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We now show that Ti = o(l), T2 = op(l) and T3 — t-^ A^(0, cj^). Then by the Slusky's theorem 
we know z„ -^^ N{0, cr^). From (CI) and the fact that 0*0; = 1, we have 



< 2||S_^(/ + A2S:4^Q^)K^/3:tll^ + 2||A2S^^Q^/3:^||^ 

< 2||K^||i||4 {I + ^2^aQa) WlmWl + 2A2||S7Q^||2||/3;:,||2 



< 2max(^^)Vill2ll(^ + A2S;4iQ^)|lill/3:^|li + 2A2||S/||2||Q^||2y 



2 



A2gi ^ " ' 
iGj4 V"- + A2gj 



< 2max ( ] llE^lli (||/||2 + A2||S;4i||2||Q^||2)' mg + 2A2||S/ ||2||Q^||2||/3^||2 



< 2max ( ^^Ybu ( 1 + + 2Xl^D^f3*jl 

ieA \n + \2qij \ bn J "^-^"^ ^bn "^-^"^ 



To obtain previous inequahties we have used sub-multipHcativity and consistency of the ||.||2 
matrix norms. Hence it follows from (C5) and (C6) that Ti = o(l). Similarly, we can bound T2 
as follows 



Ti < ||S^||i||(l + A2S;4iQ^)||2||^;4-^;:,(A2,0)|| 



< Bn{l+ 



X2DY f Xlf]-'p ^ ^ 



bn J \bn + \2d 

where we have used (f2l|) in the last step. Then = Op(l)/n^. Finally, following the same 
arguments used in Theorem 3.3 of Zou &: Zhang (2009), we obtain that T3 -^^ N{0,a'^). This 
completes the proof. □ 

Proof. PROOF OF THEOREM [33) Now, we consider the Adaptive Gril estimator, with Gril 
estimates as initial weights. The adaptive Gril estimates are defined by 

^(AdaGril) = argmmN|||y-X/3||^ + A2/3*Q/3 + At^wj|/3j.|| , (27) 

where ^ 

1/3 (Gril) J I 

Then, the minimizer of ()27p is also the minimizer of the Adaptive Lasso problem on augmented 
data (y,X) defined in ([5]). So, we have 

p p 
\\y - X;3(AdaGril)||2 + A^f ^ Wj|^(AdaGril)j| < ||y - ±(3*\\l + K^<^j\(3j\- 

i=i 3=1 

Since tildey = X/3* + e, the latter is equivalent to 

p 

||X/3* - X^(AdaGril)||i < Xl^6jj{\(3*\ - |^(AdaGri%|) + 2e*X(/3* - ^(AdaGril)). 
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Using the definition of X and e on tlie tliird term of the latter inequality, we have 

p 

||X/3* - X;3(AdaGril)||i < Xl^ujj{\(3*\ - |^(AdaGril)j|) + 2e*X(/3* - ^(AdaGril)) 

-2A2/9**Q(/3* - ^(AdaGril)). (29) 

Obviously, we have 

p P 
J]^,(|/3*|-|^(AdaGril),|) < (£;,|/3* - ^(AdaGri%| (30) 
j=i i=i 

and 

- 2A2/3**Q(/3* - ;3(AdaGril)) < 2A2||Q/3*||oo||/3* - ;9(AdaGril)||i. (31) 

For < 9 < 1, consider the set T = {maxj=i^...^p 2|C/j| < OXi} with Uj = SILi^jj'^i- 
Therefore on the set F and using ()29p . ()30p and ()3ip . we obtain 

p 

||X/3* - X^(AdaGril)||^ < XI - |^(AdaGril)j|) + eXl\\(3* - /3(AdaGril)||i 

+ 2A2||Q*/3*||oc||r-/9(AdaGril)||i. (32) 

Tacking 6* = i and A2 = 8\\Qf3'\\^ adding 2-'^Xl\\l3* - /3(AdaGril)||i to both sides of the 
previous inequality, we have 

A* ^ 
||X/3* - X;3(AdaGril)||| + ^||/3* - ;3(AdaGril)||i < Xl^6jj{\(3*\ - |;3(AdaGril)j|) 



p 



+ A]; -^(AdaGri%| (33) 



Since ujj = max I l/|/3(Gril)j|, 1 J for all j = 1, we have 



X/3* - X^(AdaGril)||2 + ^||/3* - ^(AdaGril)||i < Xl^Cjj{\(3*\ - |^(AdaGril)j 



2 



p 



+ A]:^(ij|/3* -^(AdaGri%| (34) 
i=i 

Let 6 = 8ip~^Xls,r] = minjgy^(|/3^|), a;max(-4) = maxjg_4CJj. Hebiri and Van De Geer (2010) show 
that on F 

6 > ||/3*-;3(Gril)|U. 

Suppose now that ij > 25, then we have r] > 25 > 2\\(3* — /3(Gril)||oo, and 

|;3(Gri%| > r?-||/3*-^(Gril)|U 

> I for ah j G A, (35) 
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Hence, we deduce that 



^^max(-4) < max<'-,lj>. (36) 



Using the fact that - |;9(AdaGril)j | + |/3* - ^(AdaGril)j| = 0, for all j £ A"", the triangular 
inequality and ()36p . we have 

||X/3* - X^(AdaGril)||| + ^\\(3* - ;3(AdaGril)||i 

< At ^ Cjj{\f3*\ - |;3(AdaGril)j I + |/3* - ^(AdaGril)^! 

< 2X1 Yl ~ /3(AdaGril)j| 

< 2Al(w^ax(^)) Yl \f^*^ - /3(AdaGril),| 

< 2X1 max | - , 1 } X] '-^i ~ /^(AdaGril)^ | (37) 
Since |/3j — /3(AdaGril)j | < \/s||/3^ — /3(AdaGril)_4||2, we obtain that 

||X/3* - X;3(AdaGril)||^ + ^||/3* - /3(AdaGril)||i < 2^/^Xl max l| - /3(AdaGril)^||2. 

(38) 



(39) 



According to inequality ()37p . we have 



So 



11/3* -;3(AdaGril) 111 < 4max l| ^ |/3* - /3(AdaGril)j|. 
Y - ^(AdaGri%|) < 4max |-, l| ^ |/3* - ^(AdaGril)^ 



(40) 



jeA" ^ ^ jeA 

This last inequality shows that /3* — /3(AdaGril) obeys to the assumption RE, and hence 

||/3::t-^(AdaGril)^||2 < ||/3* - ^(AdaGril)||2 

< '0-i||X/3* -X/3(AdaGril)|||. (41) 

The combination of this last inequality with ([38]) , give us 

||X/3* -X;9(AdaGril)||^ + ^||/3* -/3(AdaGril)||i < 2^AtyV^max l| x 

||X/3* -X;3(AdaGril)||2 (42) 

So 

||X/3* -X;3(AdaGril)||2 < 4-0-^*2 f max | -, ll^ s. (43) 



7] 
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Since 



|X/3* - X;3(AdaGril)||^ = ||X/3* - X^(AdaGril)||| 

+ \2{(3* - ^(AdaGril))*Q(/3* - /3(AdaGril)), (44) 



we obtain 



|X/3* -X;3(AdaGril)||^ < 4V'"^Af ^max s. 



Using (|38p and the fact that ||v||oo < ||v||i for all v G W, we have 

2 



11/3* -;3(AdaGril) 111 < SV'-^A* (^max s, 



and 

-1 \ * 



11/3* - /3(AdaGril)||oo < S^^'Xl ( max <{ -, 1 J> ) s. (47) 



2 
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a — 3 C7 = 6 (7 = 9 





p = 0.5 


p = 0.75 


p = 0.5 


p = 0.75 


p = 0.5 
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n = lUU 
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2.51 
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10.52 
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24.50 


19.53 
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1.74 


1.73 


8.38 


9.61 


22.95 


19.26 


Enet 


2.32 
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8.88 


19.89 


17.52 


AdaEnet 


1.89 


1.83 


8.90 
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21.71 
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17.55 


AdaWfusion 


1.89 


1.94 


8.80 


8.77 


21.29 


18.30 


n = zUU 
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1.74 


1.62 


7.91 


7.26 


16.31 


15.02 


AdaLasso 


1.04 


0.99 


5.06 


5.54 


13.04 


14.80 


Enet 
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7.23 


6.67 


14.63 


13.92 
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AdaWfusion 
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n = 1000 
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6.39 
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1.32 
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7.10 
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5.19 
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9.81 
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1.92 


1.89 


4.57 
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0.82 


0.68 


3.33 


2.72 


7.27 


6.39 


AdaWfusion 


0.60 


0.53 


2.39 


2.24 


5.55 


4.93 



Table 1: Median mean-squared errors for p G {0.5,0.75}, a G {3,6,9} and n G {100,200, 1000} 
based on 100 replications. 
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7.62 
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Table 2: MSE/j = ||/3 - ^*||| errors for p £ {0.5,0.75}, a £ {3,6,9} and n £ {100,200,1000} 
based on 100 replications. 
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Table 3: The median number of C and IC p = 0.5, a G {3, 6, 9} and n G {lUO, 20U, lUOU} based 
on lUO replications. 
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